1 Introduction

The success of deep learning models has typically been measured in terms of their predictive power, but they have lacked a principled way of expressing uncertainty about these predictions.

In my master’s thesis we apply recent insights connecting dropout neural networks to Bayesian approximate variational inference (VI). VI is technique for approximating intractable integrals that arise when modelling the posterior distribution in Bayesian inference and machine learning. Gal et. al. [1, 2, 3] have shown that the posterior distribution of a NN can be approximated by leaving dropout on at test time and sampling multiple predictions. This amounts to drawing Monte Carlo (MC) samples from the posterior distribution (a brief description of this process is listed in section 1.1). The Bayesian framework allows us to obtain principled uncertainty estimates when making predictions in these so called MC dropout NNs.

The results of [1,2,3] seem particularly promising when applied to healthcare. In fact, a recent paper published in Nature [4] applies MC dropout to capture uncertainty estimates when predicting the presence of diabetic retinopathy in patients. This paper demonstrates how uncertainty estimates can provide useful information to the clinician tasked with interpreting the results of medical images. The key idea is that this human-machine interaction will lead to overall better results than either could produce individually.

In this notebook I will first briefly introduce the general idea behind MC dropout. Then we will demonstrate how MC dropout has been applied in existing literature through a practical example. We will use the fastai framework, and the goal is to estimate uncertainty on a classification task based on the collection of labelled images in the CIFAR-10 data set.

1.1 Background

A neural network is made up of many neurons which are organized in layers. Neurons are often called nodes or units, depending on your choice of machine learning literature. Henceforth we will refer to them as units.

In a typical neural network there are many, many units. As a consequence the number of parameters greatly exceeds the number of data points, dramatically increasing the risk of overfitting (i.e. there are many settings of weights which will cause the network to fit the data almost perfectly).

Dropout is a common stochastic regularisation technique [5] that is used to prevent overfitting in neural networks. The term dropout refers to randomly “dropping out” nonoutput units, temporarily removing all connections to the rest of the network. The main idea is that if the presence of other units is unreliable, each unit is forced to learn how to be useful on its own. At test time all units are activated, and the weights of the network are scaled by the dropout rate in order to match the expected output during training.

Recent work by Gal et. al. [1, 2, 3] casts dropout neural networks as approximate Bayesian inference. Their results show that the predictive posterior distribution of a neural network can be approximated by leaving dropout on at test time.

Consider a classification setting, such as in CIFAR-10. Essentially, what happens when applying MC dropout is the following:

  • An image is fed forward through the network \(T\) times (we use \(T=100\) as recommended in [4]). Each time the image is fed through is called a stochastic forward pass.
  • For each stochastic forward pass, a slightly different network is making predictions because dropout has randomly switched off units.
  • As a result each stochastic forward pass returns 100 slightly different vectors of class predictions.
  • To make a prediction we average the 100 vectors. The is called the predictive mean vector, denoted by \(\hat{\mu}\). The class corresponding to the largest element in the resulting vector is our final prediction, i.e. \(p(y=k|X,w) = \text{argmax}_k \hat{\mu}\).
  • Finally we calculate the standard deviation in class predictions over all forward passes. This is our estimated uncertainty.

Gal et. al. [3] show that the above amounts to drawing Monte Carlo samples from the predictive posterior. Their work demonstrates that applying dropout is effectively the same as defining a Bayesian neural network with a Bernoulli approximating prior over the parameters. Gal has written an excellent blog post that introduces the work and the derivation.

In practical applications [4], model uncertainy is approximated by the empirical standard deviation of the predictions for class \(k\), i.e. \[\hat{\sigma}_k = \sqrt{\frac{\sum_{t=1}^T{[p_t(k|X,w) - \hat{\mu}_k}]^2}{T-1}}\] where \[\hat{\mu}_k = \frac{1}{T}\sum_{t=1}^T p_t(k|X,w)\] is the averaged softmax outputs of the predicted class. This is a somewhat ad hoc approximation, but recent research suggests that \(\hat{\sigma}_k\) is a meaningful uncertainty estimate [7].

1.2 Problem statement

The derivation in Gal et. al. [2] is based on the observation that a shallow neural network with infinitely many weights converges to a Gaussian process (GP) with a specific covariance function. A GP is a non-parametric, probabilistic machine learning method. The idea is that we place a prior distribution over the function space, and by observing new data points we can figure out which function is most likely to have generated the observed data. A GP is fully specified by its mean and covariance functions, and allows us to obtain uncertainty estimates [1]. The extension of this to dropout neural networks is the main result of Gal et. al. [2].

In [1] however, Gal et. al. extend this idea further to include priors over the weights of the convolutional layers in a convolutional neural network (CNN). A CNN is a specialized type of neural network, used primarily and very effectively for image analysis. The authors briefly state that the GP interpretation is lost when the model is extended to CNNs, but that these networks still can be “modelled as Bayesian”.

As far as we are aware, there have been no efforts to determine the correlation between the approximated uncertainty and a network’s ability to predict correctly. Moreover, the work done so far seems to focus on the uncertainty associated with the predicted class. In other words, our problem statement is:

Are the uncertainty approximations obtained by applying Monte Carlo dropout to convolutional neural networks a reasonable measure of model uncertainty?

We will also examine if there is any additional information to be gained from establishing a connection between the prediction and the runner-up prediction.

2 Setup

In this notebook we use the following R packages: tidyverse and ggplot2. In addition we use the multiplot functionality provided by Cookbook for R. In Python we use fastai, numpy and pandas. Full code for training and inference available at GitHub. All code for the exploratory data analysis is included in this document.

2.1 Experimental setup

State-of-the-art architectures such as ResNet and DenseNet are very powerful, but they are also complicated and their inner workings are quite convoluted. We are primarily interested in examining the correlation between uncertainty estimation and predictive capabilities. It is arguably better to use a simple network architecture to illustrate the idea of MC dropout. In our approach we use LeNet-5. Click on the tabs for implementation details. Full code is available on GitHub. In the following we assume basic familiarity with NNs and specifically CNNs. NOTE: Click on Code-tab to expand chunks.

2.1.1 Model

LeNet-5 was a pioneering 7-layer convolutional neural network originally developed by Yann LeCunn in 1998 for handwritten digit recognition. It is hopelessly primitive compared to contemporary architectures, but is very useful for experimentation and proof-of-concept work due to it’s simplicity. The following chunk shows the model architecture:

class lenet_all(nn.Module):
    def __init__(self, conv_size=conv_size, pool_size=2, drop_rate=p):
        super().__init__()
        self.drop_rate = drop_rate
        self.conv1 = nn.Conv2d(in_channels=3, out_channels=192, kernel_size=conv_size)
        self.dropmc1 = DropoutMC(p)
        self.pool1 = nn.MaxPool2d(kernel_size=pool_size, stride=2)
        self.conv2 = nn.Conv2d(in_channels=192, out_channels=192, kernel_size=conv_size, padding=2)
        self.dropmc2 = DropoutMC(p)
        self.pool2 = nn.MaxPool2d(kernel_size=pool_size, stride=2)
        self.dense1 = nn.Linear(in_features=7*7*192, out_features=1000)
        self.dropmc3 = DropoutMC(p)
        self.dense2 = nn.Linear(in_features=1000, out_features=10)
        
    def forward(self, x):
        x = self.dropmc1(self.conv1(x))
        x = self.pool1(x)
        x = self.dropmc2(self.conv2(x))
        x = self.pool2(x)
        x = x.view(x.size(0), -1)
        x = self.dropmc3(F.relu(self.dense1(x)))       
        x = self.dense2(x)
        
        return x


2.1.2 MC dropout layer

Our specification of LeNet-5 differs from the orginial in one crucial way: We use MC dropout layers. MC dropout is not a feature that is implemented in PyTorch, and we must therefore implemenet such a layer ourselves. Fortunately, this amounts to a simple adjustment of existing code. We modify the Dropout class to take an additional argument called dropoutMC with default value set to True:

class DropoutMC(nn.Module):
    r"""
    Modified version of Dropout from torch/nn/modules/dropout.py
    Args:
        p: probability of an element to be zeroed. Default: 0.5
        dropoutMC: If set to ``True``, dropout is turned on at test time. Default: ``True`
        inplace: If set to ``True``, will do this operation in-place. Default: ``False``
    Shape:
        - Input: `Any`. Input can be of any shape
        - Output: `Same`. Output is of the same shape as input
    Examples::
        >>> m = nn.Dropout(p=0.2)
        >>> input = autograd.Variable(torch.randn(20, 16))
        >>> output = m(input)
    .. _Improving neural networks by preventing co-adaptation of feature
        detectors: https://arxiv.org/abs/1207.0580
    """

    def __init__(self, p=0.5, dropoutMC=True, inplace=False):
        super(DropoutMC, self).__init__()
        if p < 0 or p > 1:
            raise ValueError("dropout probability has to be between 0 and 1, "
                             "but got {}".format(p))
        self.p = p
        self.dropoutMC = dropoutMC
        self.inplace = inplace

    def forward(self, input):
        return F.dropout(input, self.p, self.dropoutMC, self.inplace)

    def __repr__(self):
        inplace_str = ', inplace' if self.inplace else ''
        return self.__class__.__name__ + '(' \
            + 'p=' + str(self.p) \
            + inplace_str + ')'


2.1.3 Inference

We need to define a function that performs inference over out input. The following chunk contains the relevant code for the sampling procedure described in section 1.1. The function inference stores all the relevant statistics and softmax distributions in a dictionary named output. The results are then turned into a pandas dataframe and some very basic feature engineering is performed. Finally, the data is prepared for statistical analysis in R.

def inference(learner, data, T=100):
    ''' Function that gathers all relevant numerical results from MC dropout over T iterations.
        
        Arguments:
        learner, fastai learner object
        data, fastai dataloader
        T, number of stochastic forward passes
    '''
    # Get images, labels and filenames
    imgs, labels = next(iter(data.val_dl))
    fnames = data.val_ds.fnames

    # Empty dictionary to store all output
    output = {}

    # Empty array to store results in
    results = np.empty((T, num_classes))

    # iterator index to keep in dictionary
    k=0

    for (img, label, fname) in list(zip(imgs, labels, fnames)):

        for i in range(T):
            prediction = learner.predict_array(img[None])
            results[i] = prediction

        probs = to_np(F.softmax(V(results)))
        probs_mean = np.mean(probs, axis=0)
        pred_std = np.std(probs, axis=0)

        prediction = probs_mean.argmax()
        uncertainty = pred_std[prediction]

        correct = 1 if prediction == label else 0

        output[k] = {"img": fname, "softmax_dist": probs, "probs": probs_mean, "prediction": prediction, "truth": label, "uncertainty": uncertainty, "correct": correct}
        k+=1
    
    return output

2.2 Training

Our model was trained on CIFAR-10 using the fastai API. CIFAR-10 contains 60.000 labelled 32x32x3 color images belonging to 10 different classes. The input data was split into a training set of 50.000 images and a test set of 10.000 images. The training set was further split into a training set and a validation set.

The network was trained for 60 epochs with a learning rate of \(0.001\) (found using lr_find functionality in fastai), kernel size = (5,5), drop_rate = .5 and weight_decay = 0.0005. It is identical in structure to the one used by Gal et. al. [1]. The validation loss was 0.71280 at end of training with an accuracy of 0.76137 on the validation data.

2.3 Data

The data contains the following variables after it has been prepared for analysis in R:

  • correct (logical): indicator the is TRUE if the predicted class label matches the true class label, else FALSE.

  • prediction (int): predicted class label.

  • truth (int): true class label.

  • uncertainty (dbl): empirical standard deviation of softmax values for predicted class.

  • prob1 (dbl): argmax of mean softmax output, i.e. mean probability of predicted class.

  • prob2 (dbl): mean probability of runner-up prediction.

  • class2 (int): class label of runner-up prediction.

  • diff (dbl): prob1-prob2

  • diff_sd_ratio (dbl): diff/uncertainty.

All the variables above are pretty standard, with the exception of diff_sd_ratio. Intuitively, if diff is large, the averaged models all agree that class \(k\) is the correct prediction. If diff is small, the models sampled by MC dropout don’t agree on a single class. Thus diff also serves as a proxy for uncertainty. Model uncertainty, however, is approximated by the empirical standard deviation of the predictions for class \(k\). Thus diff_sd_ratio is expressed by \[\tau_{kj} = \frac{\hat{\mu}_k - \hat{\mu}_j}{\hat{\sigma}_k}\] where \(j\) is the runner-up prediction. \(\tau_{jk}\) gives us a ratio of two different measures of uncertainty.

3 Classification Results

The following table summarises the results over the entire data set:

# Importing data
data <- read.csv("~/Documents/Masteroppgave/Data/Resultater/lenet-model55.csv")
df <- dplyr::select(data, -X)
# Aggregating summary statistics by correct/incorrect
results <- df %>%
  summarise(n=n(),
            accuracy = sum(correct==1)/n,
            mean_prob1=mean(prob1),
            mean_prob2=mean(prob2),
          mean_uncertainty=mean(uncertainty),
          median_uncertainty=median(uncertainty),
          sd_uncertainty=sd(uncertainty), 
          mean_diff=mean(diff),
          median_diff=median(diff),
          sd_diff=sd(diff),
          mean_ratio=mean(diff_sd_ratio),
          median_ratio=median(diff_sd_ratio),
          sd_ratio=sd(diff_sd_ratio))
results

The next table shows the summary statistics after partitioning the data by incorrect/correct classifications:

# Aggregating summary statistics by correct/incorrect
agg_df <- df %>% 
  group_by(correct) %>% 
  summarise(n=n(),
            accuracy=n/10000,
            mean_prob1=mean(prob1),
            mean_prob2=mean(prob2),
          mean_uncertainty=mean(uncertainty),
          median_uncertainty=median(uncertainty),
          sd_uncertainty=sd(uncertainty), 
          mean_diff=mean(diff),
          median_diff=median(diff),
          sd_diff=sd(diff),
          mean_ratio=mean(diff_sd_ratio),
          median_ratio=median(diff_sd_ratio),
          sd_ratio=sd(diff_sd_ratio))
agg_df

Overall our model classifies 7916 images correctly out of \(N=10.000\), giving it an accuracy of 79.16%. In brief, the summary statistics indicate that softmax outputs of the predicted and runner-up classes seem to be closer to each other when the model misclassifies. Furthermore, incorrectly classified images seem to be associated with greater uncertainty on average. This result is of crucial importance and is a necessary condition for exploring how uncertainty approximation correlates with model predictions.

The model is most successful when classifying images of automobiles (902 correct) and frogs (891 correct). The least successful predictions occur in the cat category, with 565 correct classifications. This is summarized in the following confusion matrix:

The confusion matrix above provides details of exactly how the model fails. In images of cats the most frequent misclassification is dog (145). For dogs the most frequent incorrect label is cat (135). In general it seems that the misclassifications, although wrong, belong to the same domain as the correct label (i.e. one vehicle type is mistaken for another and one species of animal is misclassified as another). This suggests that our model is unable to learn distinguishing features which adequately seperate similar classes. However, one should cautious about reading too much into the type of errors being made. The results may very well be an artefact of the model, the data set or both.

3.1 Examples of Uncertain Classifications

In the following we will visualize the relationships between our variables. We start by looking at examples of certain and uncertain images. The we examine the empirical distribution of the uncertainty estimates \(\hat{\sigma}_k\).

3.1.1 Incorrect classifications

The following plots were generated using Python (code available on GitHub). On the left hand side we see the unnormalized image with the corresponding ground truth label. The plot in the middle shows the softmax output of the predicted class for each of the \(T=100\) stochastic forward passes. \(\mu_k\) is given by the solid red line, \(\mu_j\) is given by the dashed brown line. The plot title shows both the predicted class and the runner-up class. To the right is a kernel density estimate (using a Gaussian kernel) of the \(T=100\) softmax outputs for the predicted class. The plots show the top 5 most uncertain classifications in the entire data set.

3.1.2 Correct classifications

The runner-up prediction for the five most uncertain but correct predictions are all “airplane”. One possible explanation is the presence of large areas of blue and/or white, which we might assume would be present as backgrounds in images of planes flying through the sky.

3.2 Distribution of Uncertainty

The distribution of total uncertainty appears to be bimodal, with peaks close to 0 and 0.2:

# Distribution of estimated uncertainty
p1 <- df %>% 
  ggplot(aes(x=uncertainty)) +
  geom_histogram(col="grey", bins = 50, alpha=.5) +
  ggtitle("Distribution of estimated uncertainty") +
  theme_bw()
p1

By grouping the uncertainty estimates by correct (i.e. if the label was correctly predicted or not), we can find out how the predictions contribute to the uncertainty distribution:

#Distribution of estimated uncertainty by prediction
p2 <- df %>% 
  ggplot(aes(x=uncertainty, col=as.factor(correct))) +
  geom_freqpoly(alpha=.7) +
  ggtitle("Distribution of estimated uncertainty by classification") +
  scale_color_discrete(name="Prediction",
                         breaks=c("0", "1"),
                         labels=c("0: Incorrect", "1: Correct")) +
  theme_bw()
p2

The blue line corresponds to the correct predictions, the red line corresponds to incorrect predictions. We see that the incorrect predictions are centered around a higher associated uncertainty, whereas far more of the correctly predicted classes are concentrated around a low uncertainty value. The incorrect classifications greatly contribute to the bimodality, but it is also present in the distribution of uncertainty for the correct classifications.

The following density plots gives us an idea of how the distributions compare to eachother:

# KDE by correct prediction
lines_df <- df %>% 
  group_by(correct) %>% 
  summarise(mean_uncertainty=mean(uncertainty))
p3 <- df %>% 
  ggplot(aes(x=uncertainty, group=factor(correct))) +
  geom_histogram(aes(y=..density.., fill=factor(correct)), position="identity", bins = 50, alpha=.5) +
  geom_density(aes(y=..density.., col=factor(correct)), alpha=.5) +
  geom_rug(aes(x=uncertainty, col=factor(correct)), alpha=.5) +
  geom_vline(data = lines_df, aes(xintercept=mean_uncertainty, col=factor(correct)), lty="dashed") +
  ggtitle("Distributions of uncertainty by classification") +
  labs(color="Prediction",
       fill="Prediction") +
  scale_color_discrete(breaks=c("0", "1"),
                         labels=c("0: Incorrect", "1: Correct")) +
  scale_fill_discrete(breaks=c("0", "1"),
                         labels=c("0: Incorrect", "1: Correct")) +
  theme_bw() +
  theme(legend.position = c(.9,.85))
p3

The boxplot gives us yet another way to visualize the difference between uncertainty distributions by predictive ability:

# Boxplot of uncertainties for correct vs. incorrect
p4 <- df %>% 
  ggplot(aes(x=as.factor(correct), y=uncertainty)) +
  geom_boxplot(aes(fill=as.factor(correct)), alpha=.7) +
  labs(x="correct") +
  ggtitle("Boxplot of uncertainty distribution by correct/incorrect") +
  scale_fill_discrete(name="Prediction",
                         breaks=c("0", "1"),
                         labels=c("0: Incorrect", "1: Correct")) + 
  theme_bw()
p4

If \(\hat{\sigma}_k\) is a useful approximation of model uncertainty, then we would expect the distributions \(\hat{\sigma}_k\) for correctly and incorrectly classified images to reflect this somehow. From the figures above, it seems clear that the distributions of uncertainties are meaningfully different from one another, at least for this particular choice of model and data set. This indicates that the uncertainty approximation itself may contain valueable information about the model’s confidence in its own predictions.

3.3 Class-specific uncertainty

If \(\hat{\sigma}_k\) captures model uncertainty, then we would expect the class-specific distributions of \(\hat{\sigma}_k\) to somehow echo the confusion matrix in section 3. Let \(\hat{\sigma}_j\) denote the class-specific uncertainty, where \(j \in \{\text{airplane},...,\text{truck}\}\) is the corresponding class label.

to_string <- as_labeller(c(
  "0" = "airplane", 
  "1" = "automobile", 
  "2" = "bird",
  "3" = "cat", 
  "4" = "deer", 
  "5" = "dog", 
  "6" = "frog", 
  "7" = "horse", 
  "8" = "ship", 
  "9" = "truck"))
unc_df <- df %>% 
  group_by(truth) %>% 
  summarise(mean_uncertainty = mean(uncertainty))
mean_df <- as.tibble(rep(mean(df$uncertainty), 10))
hist_class <- df %>% 
  ggplot(aes(x=uncertainty)) +
  geom_histogram(aes(y=..density.., fill=factor(truth)), alpha=.7, bins=50) +
  geom_vline(data=unc_df, aes(xintercept = mean_uncertainty, col=factor(truth)), lwd=.5, lty="dashed") +
  geom_vline(data=mean_df, aes(xintercept = value), lwd=.3, lty="dotted") +
  facet_wrap(~truth, nrow = 2, ncol = 5, drop=FALSE, labeller = to_string) +
  theme_bw() +
  theme(legend.position = "none") +
  ggtitle("Distributions of class uncertainty")
hist_class

As noted in section 3, the model clearly mistakes some cats for dogs and some dogs for cats. The confusion seems to be reflected in the relative distributions of \(\hat{\sigma}_{cat}\) and \(\hat{\sigma}_{dog}\).

The following boxplots give an idea of the spread of \(\hat{\sigma}_j\) for each class \(j\) after grouping by classification status:

to_string <- as_labeller(c(
  "0" = "airplane", 
  "1" = "automobile", 
  "2" = "bird",
  "3" = "cat", 
  "4" = "deer", 
  "5" = "dog", 
  "6" = "frog", 
  "7" = "horse", 
  "8" = "ship", 
  "9" = "truck"))
unc_df <- df %>% 
  group_by(truth) %>% 
  summarise(mean_uncertainty = mean(uncertainty))
box_class <- df %>% 
  ggplot(aes(y=uncertainty)) +
  geom_boxplot(aes(x=factor(correct), fill=factor(correct)), alpha=.7) +
  facet_wrap(~truth, nrow=2, ncol=5, labeller = to_string) +
  theme_bw() +
  theme(legend.position = "none") +
  ggtitle("Boxplots of class uncertainty") +
  labs(x="correct")
box_class

In summary, the class-specific approximations of uncertainty is consistent with the confusion matrix in section 3: Higher uncertainty seems to be associated with categories that tend to be mislabeled.

3.4 Relationship to other variables

As \(\hat{\mu}_k\) decreases, the softmax probabilities are spread out across the elements of the predictive mean vector \(\hat{\mu}\). Another view is that the number of candidate classes increases. Naively, we would therefore expect lower values of \(\hat{\mu}_k\) to be associated with higher uncertainty.

The following figure shows the relationship between \(\hat{\sigma}_k\) and \(\hat{\mu}_k\) (we have plotted a smoothed estimate of the mean uncertainty as a function of the prob1 output to make this clearer):

# Plotting softmax output of prediction against estimated uncertainty
p5 <- df %>% 
  ggplot(aes(x=prob1, y=uncertainty)) +
  geom_point(alpha=.2) +
  geom_smooth() +
  ggtitle("Uncertainty vs. softmax prediction") +
  labs(x="softmax output of predicted class") +
  theme_bw()
p5

The plot indicates a concave shape, which contradicts our naive assumption. A possible explanation is that the minimum value of \(\hat{\mu}_k\) is bounded below by \(0.1\). As \(\hat{\mu}_k \rightarrow 0.1\), all the other elements of the predictive mean vector \(\hat{\mu}_j \rightarrow 0.1\). This follows from the fact that \(\hat{\mu}_k = \text{argmax}_k \hat{\mu}\). Pursuing this line of thought, we speculate that the model most likely yields similar classification results for each stochastic forward pass when \(\hat{\mu}_k\) is low, which consequently leads to a lower estimate of \(\hat{\sigma}_k\) thus explaining why \(\hat{\sigma}_k\) decreases. This highlights a potential drawback of the ad hoc approximation given in section 1.1 : A classification that results in maximum uncertainty with respect to a measure such as entropy (i.e. when \(\hat{\mu}_j = 0.1\) for all \(j\)) could potentially be associated with a small empirical standard deviation \(\hat{\sigma}_k\).

Furthermore, \(\hat{\sigma}_k\) seems to peak around prob1 \(\approx 0.5\). Additional information can be leveraged by colour-coding the data points based on the value of the runner-up prediction prob2, as shown in the following figure:

# Plotting softmax output of prediction against estimated uncertainty, coloured by softmax output of runner-up
p7 <- df %>% 
  ggplot(aes(x=prob1, y=uncertainty)) +
  geom_point(aes(col=prob2), shape=21, alpha=.7) +
  geom_point(data=agg_df, aes(x=mean_prob1, y=mean_uncertainty, shape=as.logical(correct)), size=2.5) +
  #geom_point(data=agg_df, aes(y=uncertainty, x=mean_prob1, shape=as.logical(correct)), size=2) +
  ggtitle("Uncertainty vs. softmax output of prediction for all observations") +
  labs(x="softmax output of predicted class", shape="correct") +
  scale_color_distiller(name="Runner up",
                        palette = "Spectral") +
  scale_shape_discrete(name=expression(paste("(",bar(mu)["pred"],", ",bar(sigma)["pred"],")")),
                       labels=c("Incorrect", "Correct")) +
  theme_bw()
p7

A blue point corresponds to an observation with a low value of prob2, whereas a red point corresponds to an observation with a high value of prob2. Additionally, we have plotted the average mean and uncertainty for correctly (black dot) and incorrectly (black triangle) classified observations.

Clearly, the largest values of the prob2 cluster around prob1. This is to be expected, simply because when prob1 \(= 0.5\) we have a maximum of “left-over” probability available for distribution among the other classes in \(\hat{\mu}\). In other words, the value of prob2 is constrained by prob1. However, the maximum values of \(\hat{\sigma}_k\) appear to be associated with relatively large pairs of prob1 and prob2. This could suggest that uncertainty estimation by ad hoc methods such as \(\hat{\sigma}_k\) could be most useful in binary classification settings, or to detect situations with a small number of competing classes in a multilabel setting.

The observations can be partitioned by classification status and overlayed with a 2D kernel density estimate (KDE). The KDE gives an idea of the joint distribution of prob1 and prob2:

# Plotting uncertainty vs. softmax output of prediction
p8 <- df %>% 
  ggplot(aes(x=prob1, y=uncertainty)) +
  geom_point(aes(col=prob2),alpha=0.5) +
  geom_density_2d(col="black", alpha=.3) +
  ggtitle("Uncertainty vs. softmax output of prediction by incorrect/correct") +
  labs(x="softmax output of prediction") +
  facet_grid(.~as.factor(correct)) +
  scale_color_distiller(name="Runner up",
                        palette = "Spectral") +
  theme_bw()
p8

The black contour lines indicate where most of the points are concentrated. The plot on the left is for incorrect predictions. The right hand plot represents the correct predictions. For the correct predictions, it seems as if far more of the points are concentrated around high predicted output/low runner-up output/low uncertainty. For the incorrect classifications, most of the points are concentrated around the area where \(\hat{\sigma}_k\) peaks. This indicates that the approximated uncertainty estimates indeed contain valuable information in the incorrect cases.

3.4.1 Runner-up predictions

The following plot shows the relationship between uncertainty estimates and the softmax output of the runner-up prediction. Unsurprisingly, model uncertainty increases as the softmax output of the runner-up increases. We have plotted a smoothed estimate of the mean uncertainty as a function of the runner-up output to make this clearer:

# Plotting softmax output of runner-up against estimated uncertainty
p10 <- df %>% 
  ggplot(aes(x=prob2, y=uncertainty)) +
  geom_point(alpha=.2) +
  geom_smooth(method="loess") +
  labs(x="softmax output of runner-up") +
  ggtitle("Uncertainty vs. softmax output of runner-up") +
  theme_bw()
p10

3.5 Uncertainty-prediction correlation

As mentioned in section 2.2, the connection established between dropout neural networks and GPs are lost when applied to convolutional neural networks. Performing a logistic regression gives us a simple way of testing if the approximated uncertainty is a significant predictor of the model’s ability to predict correctly:

# Fitting logistic regression model to check significance of uncertainty
model_sd <- glm(as.factor(correct)~uncertainty, data=df, family = binomial(link="logit"))
summary(model_sd)

Call:
glm(formula = as.factor(correct) ~ uncertainty, family = binomial(link = "logit"), 
    data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.6704   0.2409   0.3588   0.7100   2.0114  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)   3.55249    0.07625   46.59   <2e-16 ***
uncertainty -15.40803    0.43250  -35.63   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 10236.6  on 9999  degrees of freedom
Residual deviance:  8467.6  on 9998  degrees of freedom
AIC: 8471.6

Number of Fisher Scoring iterations: 5

The coefficient associated with uncertainty is highly significant, indicating that the estimates leveraged from MC dropout may indeed be a useful quantification of predictive uncertainty.

4 Referral Criteria

The question is then: How do we determine a reasonable uncertainty value for referral to a human expert? As we saw in section 3.1.2, the mean uncertainty values of the incorrect predictions higher than for the correct predictions (0.1802444 vs. 0.1021673, respectively).

Naively, we could set the threshold for referral to the mean uncertainty of the incorrectly classified images:

# Counting number of correct/incorrect by uncertainty >= .18 
mean_unc <- df %>% 
  filter(correct==0) %>% 
  summarise(mean_unc=mean(uncertainty))
referral <- df %>% 
  filter(uncertainty>=mean_unc[1,1]) %>% 
  count(correct)
referral

The problem here is apparent: Many of the correctly classified images are associated with a relatively high level of uncertainty (in our case, this is due to the bimodality of the uncertainty distribution in section 3.2.1). We speculate that although uncertainty seems to contain valueable information, the relative uncertainties of the correct/incorrect observations (in this case) may be too small to differentiate which images should be referred to an expert. We may need to amplify the quantification of uncertainty in some way.

4.1 Usefulness of runner-up predictions

For the most uncertain incorrectly classified images the runner-up suggestions are non-sensical. Still, is there any information to be obtained from the runner-up predictions? What would happen to our overall accuracy if we used the runner-up predictions for all incorrect classifications?

# Counting classification accuracy if runner-up is equal to ground truth
class2_df <- df %>%
  mutate(correct=replace(correct, correct==0 & class2==truth, 1)) %>% 
  count(correct)
class2_df

Accuracy would rise to 0.9079. This indicates that there may be some valuable information to be gathered from the runner-up predictions.

runnerup_df <- df %>% 
  filter(correct==0) %>% 
  mutate(runnerup = ifelse(class2==truth, 1, 0))
mean_runnerup <- runnerup_df %>%
  group_by(runnerup) %>% 
  summarise(meanuncertainty = mean(uncertainty),
            meanprob1 = mean(prob1),
            meanprob2 = mean(prob2),
            n=n())
p11 <- runnerup_df %>% 
  ggplot(aes(x=uncertainty, group=factor(runnerup))) +
  geom_histogram(aes(y=..density.., fill=factor(runnerup)), position="identity", bins = 50, alpha=.5) +
  geom_density(aes(y=..density.., col=factor(runnerup)), alpha=.5) +
  geom_rug(aes(x=uncertainty, col=factor(runnerup)), alpha=.5) +
  geom_vline(data = mean_runnerup, aes(xintercept=meanuncertainty, col=factor(runnerup)), lty="dashed") +
  ggtitle("Distributions of uncertainty") +
  labs(color="Prediction",
       fill="Prediction") +
  scale_color_discrete(breaks=c("0", "1"),
                         labels=c("0: Incorrect", "1: Correct")) +
  scale_fill_discrete(breaks=c("0", "1"),
                         labels=c("0: Incorrect", "1: Correct")) +
  theme_bw() +
  theme(legend.position = c(.9,.85))
p11

p12 <- runnerup_df %>% 
  ggplot(aes(x=prob1, group=factor(runnerup))) +
  geom_density(aes(y=..density.., col=factor(runnerup)), alpha=.5) +
  geom_rug(aes(x=prob1, col=factor(runnerup)), alpha=.5) +
  geom_vline(data = mean_runnerup, aes(xintercept=meanprob1, col=factor(runnerup)), lty="dashed") +
  ggtitle("Distributions of softmax predictions") +
  labs(color="Prediction",
       fill="Prediction") +
  scale_color_discrete(breaks=c("0", "1"),
                         labels=c("0: Incorrect", "1: Correct")) +
  scale_fill_discrete(breaks=c("0", "1"),
                         labels=c("0: Incorrect", "1: Correct")) +
  theme_bw() +
  theme(legend.position = c(.9,.85))
p12

p13 <- runnerup_df %>% 
  ggplot(aes(x=prob2, group=factor(runnerup))) +
  geom_density(aes(y=..density.., col=factor(runnerup)), alpha=.5) +
  geom_rug(aes(x=prob2, col=factor(runnerup)), alpha=.5) +
  geom_vline(data = mean_runnerup, aes(xintercept=meanprob2, col=factor(runnerup)), lty="dashed") +
  ggtitle("Distributions of runner up softmax probabilities") +
  labs(color="Prediction",
       fill="Prediction") +
  scale_color_discrete(breaks=c("0", "1"),
                         labels=c("0: Incorrect", "1: Correct")) +
  scale_fill_discrete(breaks=c("0", "1"),
                         labels=c("0: Incorrect", "1: Correct")) +
  theme_bw() +
  theme(legend.position = c(.9,.85))
p13

So how do we determine which images to take a closer look at?

4.2 Leveraging Runner-up Probabilities

One possible approach is to incorporate information from the value of \(\tau_{jk}\), which we introduced in section 2.5. Consider the following cases:

  • \(\tau_{kj} \approx 1\) means that diff and uncertainty are relatively similar. This happens if a) the models have failed to agree on a clear favourite (diff is small) but model uncertainty is low, or b) the models have a favourite (diff is large) but model uncertainty is high. Let’s call these referral predictions.

  • \(\tau_{kj} \to 0\) means that uncertainty is much larger than diff. These should represent uncertain predictions.

  • \(\tau_{kj} \to \infty\) means that uncertainty is much smaller than diff. These should represent non-referral predictions.

As we saw in section 3, the mean \(\tau_{jk}\) (given by mean_ratio) for the incorrectly classified images is 2.8112788 and 262.6516078 for the correctly classified images. Setting the referral threshold to the mean \(\tau_{jk}\) for the incorrect classifications (2.8112788), we get:

# Counting number of correct/incorrect by tau <= 2.8
mean_tau <- df %>% 
  filter(correct==0) %>% 
  summarise(mean_tau=mean(diff_sd_ratio))
  
referral_mean <- df %>% 
  filter(diff_sd_ratio <= mean_tau[1,1]) %>% 
  count(correct)
referral_mean

We run into the same problem as when we used the mean uncertainty: Many of the correctly predicted images would be referred, in fact more than when we used the uncertainty estimates. It is interesting to note, however, that we get far more referrals of incorrectly classified images. Again, this may indicate that \(\tau_{jk}\) is a useful measure of uncertainty.

Setting the referral rate of the mean \(\tau_{jk} \leq 1\) gives us the following:

# Counting number of correct/incorrect by tau <= 1
median_tau <- df %>% 
  filter(correct==0) %>% 
  summarise(median_tau=median(diff_sd_ratio))
referral_1 <- df %>% 
  filter(diff_sd_ratio <= 1) %>% 
  count(correct)
referral_1

This is a clear improvement over our first approach, in terms of the amount of referred images. 731 fewer incorrect images are referred compared to 1315 fewer correct images. This could indicate that \(\tau_{jk}\) is a useful quantification of uncertainty.

5 References

[1] Gal, Y. & Ghahramani, Z. Bayesian Convolutional Neural Networks with Bernoulli Approximate Variational Inference. arXiv: 1506.02158 (2015).

[2] Gal, Y. & Ghahramani, Z. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. arXiv: 1506.02142 (2015).

[3] Gal, Y. & Ghahramani, Z. Dropout as a Bayesian Approximation: Appendix. arXiv: 1506.02157 (2015).

[4] Leibig, C. & Allken, V. et. al. Leveraging uncertainty information from deep neural networks for disease detection. Scientific Reports volume 7, Article number: 17816 (2017).

[5] Hinton, G. & Srivastava, N. et. al. Improving neural networks by preventing co-adaptation of feature detectors. arXiv: 1207.0580 (2012).

[6] Leibig, C. & Allken, V. et. al. Leveraging uncertainty information from deep neural networks for disease detection: supplementary material. Scientific Reports volume 7, Article number: 17816 (2017).

[7] Smith, L., & Gal, Y. Understanding Measures of Uncertainty for Adversarial Example Detection. arXiv preprint arXiv:1803.08533. (2018)

---
title: "DAT259: An Exploratory Data Analysis of Approximated Uncertainty in Deep Learning"
author: "Sean Meling Murray"
output:
  html_document:
    df_print: paged
    toc: yes
  html_notebook:
    number_sections: yes
    toc: yes
    code_folding: hide
---

***
# Introduction

The success of deep learning models has typically been measured in terms of their predictive power, but they have lacked a principled way of expressing uncertainty about these predictions.

In my master's thesis we apply recent insights connecting dropout neural networks to [Bayesian approximate variational inference (VI)](https://arxiv.org/abs/1601.00670). VI is technique for approximating intractable integrals that arise when modelling the posterior distribution in Bayesian inference and machine learning. Gal et. al. [1, 2, 3] have shown that the posterior distribution of a NN can be approximated by leaving dropout on at test time and sampling multiple predictions. This amounts to drawing Monte Carlo (MC) samples from the posterior distribution (a brief description of this process is listed in section 1.1). The Bayesian framework allows us to obtain principled uncertainty estimates when making predictions in these so called *MC dropout* NNs. 

The results of [1,2,3] seem particularly promising when applied to healthcare. In fact, a recent paper published in Nature [4] applies MC dropout to capture uncertainty estimates when predicting the presence of diabetic retinopathy in patients. This paper demonstrates how uncertainty estimates can provide useful information to the clinician tasked with interpreting the results of medical images. The key idea is that this human-machine interaction will lead to overall better results than either could produce individually.

In this notebook I will first briefly introduce the general idea behind MC dropout. Then we will demonstrate how MC dropout has been applied in existing literature through a practical example. We will use the `fastai` framework, and the goal is to estimate uncertainty on a classification task based on the collection of labelled images in the [CIFAR-10 data set](https://www.cs.toronto.edu/~kriz/cifar.html).

## Background

A neural network is made up of many neurons which are organized in layers. Neurons are often called nodes or units, depending on your choice of machine learning literature. Henceforth we will refer to them as units. 

In a typical neural network there are many, many units. As a consequence the number of parameters greatly exceeds the number of data points, dramatically increasing the risk of overfitting (i.e. there are many settings of weights which will cause the network to fit the data almost perfectly).

Dropout is a common stochastic regularisation technique [5] that is used to prevent overfitting in neural networks. The term dropout refers to randomly "dropping out" nonoutput units, temporarily removing all connections to the rest of the network. The main idea is that if the presence of other units is unreliable, each unit is forced to learn how to be useful on its own. At test time all units are activated, and the weights of the network are scaled by the dropout rate in order to match the expected output during training.

Recent work by Gal et. al. [1, 2, 3] casts dropout neural networks as approximate Bayesian inference. Their results show that the predictive posterior distribution of a neural network can be approximated by leaving dropout on at test time. 

Consider a classification setting, such as in CIFAR-10. Essentially, what happens when applying MC dropout is the following:

* An image is fed forward through the network $T$ times (we use $T=100$ as recommended in [4]). Each time the image is fed through is called a *stochastic forward pass*.
* For each stochastic forward pass, a slightly different network is making predictions because dropout has randomly switched off units. 
* As a result each stochastic forward pass returns 100 slightly different vectors of class predictions.
* To make a prediction we average the 100 vectors. The is called the **predictive mean** vector, denoted by $\hat{\mu}$. The class corresponding to the largest element in the resulting vector is our final prediction, i.e. $p(y=k|X,w) = \text{argmax}_k \hat{\mu}$.
* Finally we calculate the standard deviation in class predictions over all forward passes. This is our estimated uncertainty.

Gal et. al. [3] show that the above amounts to drawing Monte Carlo samples from the predictive posterior. Their work demonstrates that applying dropout is effectively the same as defining a *Bayesian neural network* with a Bernoulli approximating prior over the parameters. [Gal has written an excellent blog post that introduces the work](http://mlg.eng.cam.ac.uk/yarin/blog_3d801aa532c1ce.html) and the derivation.

In practical applications [4], model uncertainy is approximated by the empirical standard deviation of the predictions for class $k$, i.e. $$\hat{\sigma}_k = \sqrt{\frac{\sum_{t=1}^T{[p_t(k|X,w) - \hat{\mu}_k}]^2}{T-1}}$$ where $$\hat{\mu}_k = \frac{1}{T}\sum_{t=1}^T p_t(k|X,w)$$ is the averaged softmax outputs of the predicted class. This is a somewhat ad hoc approximation, but recent research suggests that $\hat{\sigma}_k$ is a meaningful uncertainty estimate [7].

## Problem statement

The derivation in Gal et. al. [2] is based on the observation that a shallow neural network with infinitely many weights converges to a Gaussian process (GP) with a specific covariance function. A GP is a non-parametric, probabilistic machine learning method. The idea is that we place a prior distribution over the function space, and by observing new data points we can figure out which function is most likely to have generated the observed data. A GP is fully specified by its mean and covariance functions, and allows us to obtain uncertainty estimates [1]. The extension of this to dropout neural networks is the main result of Gal et. al. [2].

In [1] however, Gal et. al. extend this idea further to include priors over the weights of the convolutional layers in a convolutional neural network (CNN). A CNN is a specialized type of neural network, used primarily and very effectively for image analysis. The authors briefly state that the GP interpretation is lost when the model is extended to CNNs, but that these networks still can be "modelled as Bayesian".

As far as we are aware, there have been no efforts to determine the correlation between the approximated uncertainty and a network's ability to predict correctly. Moreover, the work done so far seems to focus on the uncertainty associated with the predicted class. In other words, our problem statement is:

> *Are the uncertainty approximations obtained by applying Monte Carlo dropout to convolutional neural networks a reasonable measure of model uncertainty?*

We will also examine if there is any additional information to be gained from establishing a connection between the prediction and the *runner-up* prediction.

# Setup
In this notebook we use the following R packages: `tidyverse` and `ggplot2`. In addition we use the `multiplot` functionality provided by [Cookbook for R](http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_(ggplot2)/). In Python we use `fastai`, `numpy` and `pandas`. Full code for training and inference [available at GitHub](link). All code for the exploratory data analysis is included in this document.

```{r echo=FALSE, results='hide'}
# Loading useful packages
library(tidyverse)
library(ggplot2)
library(caret)

# Multiple plot function
#
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols:   Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
#
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
# then plot 1 will go in the upper left, 2 will go in the upper right, and
# 3 will go all the way across the bottom.
#
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  library(grid)

  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)

  numPlots = length(plots)

  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                    ncol = cols, nrow = ceiling(numPlots/cols))
  }

 if (numPlots==1) {
    print(plots[[1]])

  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}

# Data cleaning function
process <- function(ROOT, filename, name, kernel_sz, drop_rate){
  ROOT <- "~/Documents/Masteroppgave/Data/Resultater/"
  PATH <- paste(ROOT, filename, sep="")
  dat <- as.tibble(read.csv(as.character(PATH)))
  #dat <- select(dat, -X)
  dat <- dat %>% 
    mutate(model=name,
           kernel=kernel_sz,
           dropout=drop_rate)
  return(dat)
}
```

## Experimental setup{.tabset}

State-of-the-art architectures such as ResNet and DenseNet are very powerful, but they are also complicated and their inner workings are quite convoluted. We are primarily interested in examining the correlation between uncertainty estimation and predictive capabilities. It is arguably better to use a simple network architecture to illustrate the idea of MC dropout. In our approach we use LeNet-5. Click on the tabs for implementation details. Full code is available on [GitHub](link). In the following we assume basic familiarity with NNs and specifically CNNs. **NOTE: Click on Code-tab to expand chunks.**

### Model

LeNet-5 was a pioneering 7-layer convolutional neural network originally developed by Yann LeCunn in 1998 for handwritten digit recognition. It is hopelessly primitive compared to contemporary architectures, but is very useful for experimentation and proof-of-concept work due to it's simplicity. The following chunk shows the model architecture:

```{python python.reticulate=FALSE, eval=FALSE}
class lenet_all(nn.Module):
    def __init__(self, conv_size=conv_size, pool_size=2, drop_rate=p):
        super().__init__()
        self.drop_rate = drop_rate
        self.conv1 = nn.Conv2d(in_channels=3, out_channels=192, kernel_size=conv_size)
        self.dropmc1 = DropoutMC(p)
        self.pool1 = nn.MaxPool2d(kernel_size=pool_size, stride=2)
        self.conv2 = nn.Conv2d(in_channels=192, out_channels=192, kernel_size=conv_size, padding=2)
        self.dropmc2 = DropoutMC(p)
        self.pool2 = nn.MaxPool2d(kernel_size=pool_size, stride=2)
        self.dense1 = nn.Linear(in_features=7*7*192, out_features=1000)
        self.dropmc3 = DropoutMC(p)
        self.dense2 = nn.Linear(in_features=1000, out_features=10)
        
    def forward(self, x):
        x = self.dropmc1(self.conv1(x))
        x = self.pool1(x)
        x = self.dropmc2(self.conv2(x))
        x = self.pool2(x)
        x = x.view(x.size(0), -1)
        x = self.dropmc3(F.relu(self.dense1(x)))       
        x = self.dense2(x)
        
        return x
```
<br>

### MC dropout layer

Our specification of LeNet-5 differs from the orginial in one crucial way: We use MC dropout layers. MC dropout is not a feature that is implemented in PyTorch, and we must therefore implemenet such a layer ourselves. Fortunately, this amounts to a simple adjustment of existing code. We modify the `Dropout` class to take an additional argument called `dropoutMC` with default value set to `True`:

```{python python.reticulate=FALSE, eval=FALSE}
class DropoutMC(nn.Module):
    r"""
    Modified version of Dropout from torch/nn/modules/dropout.py
    Args:
        p: probability of an element to be zeroed. Default: 0.5
        dropoutMC: If set to ``True``, dropout is turned on at test time. Default: ``True`
        inplace: If set to ``True``, will do this operation in-place. Default: ``False``
    Shape:
        - Input: `Any`. Input can be of any shape
        - Output: `Same`. Output is of the same shape as input
    Examples::
        >>> m = nn.Dropout(p=0.2)
        >>> input = autograd.Variable(torch.randn(20, 16))
        >>> output = m(input)
    .. _Improving neural networks by preventing co-adaptation of feature
        detectors: https://arxiv.org/abs/1207.0580
    """

    def __init__(self, p=0.5, dropoutMC=True, inplace=False):
        super(DropoutMC, self).__init__()
        if p < 0 or p > 1:
            raise ValueError("dropout probability has to be between 0 and 1, "
                             "but got {}".format(p))
        self.p = p
        self.dropoutMC = dropoutMC
        self.inplace = inplace

    def forward(self, input):
        return F.dropout(input, self.p, self.dropoutMC, self.inplace)

    def __repr__(self):
        inplace_str = ', inplace' if self.inplace else ''
        return self.__class__.__name__ + '(' \
            + 'p=' + str(self.p) \
            + inplace_str + ')'
```
<br>

### Inference

We need to define a function that performs inference over out input. The following chunk contains the relevant code for the sampling procedure described in section 1.1. The function `inference` stores all the relevant statistics and softmax distributions in a dictionary named `output`. The results are then turned into a `pandas` dataframe and some very basic feature engineering is performed. Finally, the data is prepared for statistical analysis in R.

```{python python.reticulate=FALSE, eval=FALSE}
def inference(learner, data, T=100):
    ''' Function that gathers all relevant numerical results from MC dropout over T iterations.
        
        Arguments:
        learner, fastai learner object
        data, fastai dataloader
        T, number of stochastic forward passes
    '''
    # Get images, labels and filenames
    imgs, labels = next(iter(data.val_dl))
    fnames = data.val_ds.fnames

    # Empty dictionary to store all output
    output = {}

    # Empty array to store results in
    results = np.empty((T, num_classes))

    # iterator index to keep in dictionary
    k=0

    for (img, label, fname) in list(zip(imgs, labels, fnames)):

        for i in range(T):
            prediction = learner.predict_array(img[None])
            results[i] = prediction

        probs = to_np(F.softmax(V(results)))
        probs_mean = np.mean(probs, axis=0)
        pred_std = np.std(probs, axis=0)

        prediction = probs_mean.argmax()
        uncertainty = pred_std[prediction]

        correct = 1 if prediction == label else 0

        output[k] = {"img": fname, "softmax_dist": probs, "probs": probs_mean, "prediction": prediction, "truth": label, "uncertainty": uncertainty, "correct": correct}
        k+=1
    
    return output
```

## Training

Our model was trained on CIFAR-10 using the `fastai` API. CIFAR-10 contains 60.000 labelled 32x32x3 color images belonging to 10 different classes. The input data was split into a training set of 50.000 images and a test set of 10.000 images. The training set was further split into a training set and a validation set. 

The network was trained for 60 epochs with a learning rate of $0.001$ (found using `lr_find` functionality in `fastai`), `kernel size = (5,5)`, `drop_rate = .5` and `weight_decay = 0.0005`. It is identical in structure to the one used by Gal et. al. [1]. The **validation loss** was **0.71280** at end of training with an **accuracy of 0.76137** on the validation data.

## Data

The data contains the following variables after it has been prepared for analysis in R:

* `correct` (logical): indicator the is `TRUE` if the predicted class label matches the true class label, else `FALSE`.

* `prediction` (int): predicted class label.

* `truth` (int): true class label.

* `uncertainty` (dbl): empirical standard deviation of softmax values for predicted class.

* `prob1` (dbl): argmax of mean softmax output, i.e. mean probability of predicted class.

* `prob2` (dbl): mean probability of runner-up prediction.

* `class2` (int): class label of runner-up prediction.

* `diff` (dbl): `prob1`-`prob2`

* `diff_sd_ratio` (dbl): `diff/uncertainty`.

All the variables above are pretty standard, with the exception of `diff_sd_ratio`. Intuitively, if `diff` is large, the averaged models all agree that class $k$ is the correct prediction. If `diff` is small, the models sampled by MC dropout don't agree on a single class. Thus `diff` also serves as a proxy for uncertainty. Model `uncertainty`, however, is approximated by the empirical standard deviation of the predictions for class $k$. Thus `diff_sd_ratio` is expressed by $$\tau_{kj} = \frac{\hat{\mu}_k - \hat{\mu}_j}{\hat{\sigma}_k}$$ where $j$ is the runner-up prediction. $\tau_{jk}$ gives us a ratio of two different measures of uncertainty.

# Classification Results

The following table summarises the results over the entire data set:

```{r}
# Importing data
data <- read.csv("~/Documents/Masteroppgave/Data/Resultater/lenet-model55.csv")
df <- dplyr::select(data, -X)

# Aggregating summary statistics by correct/incorrect
results <- df %>%
  summarise(n=n(),
            accuracy = sum(correct==1)/n,
            mean_prob1=mean(prob1),
            mean_prob2=mean(prob2),
          mean_uncertainty=mean(uncertainty),
          median_uncertainty=median(uncertainty),
          sd_uncertainty=sd(uncertainty), 
          mean_diff=mean(diff),
          median_diff=median(diff),
          sd_diff=sd(diff),
          mean_ratio=mean(diff_sd_ratio),
          median_ratio=median(diff_sd_ratio),
          sd_ratio=sd(diff_sd_ratio))
results
```
The next table shows the summary statistics after partitioning the data by incorrect/correct classifications:
```{r}
# Aggregating summary statistics by correct/incorrect
agg_df <- df %>% 
  group_by(correct) %>% 
  summarise(n=n(),
            accuracy=n/10000,
            mean_prob1=mean(prob1),
            mean_prob2=mean(prob2),
          mean_uncertainty=mean(uncertainty),
          median_uncertainty=median(uncertainty),
          sd_uncertainty=sd(uncertainty), 
          mean_diff=mean(diff),
          median_diff=median(diff),
          sd_diff=sd(diff),
          mean_ratio=mean(diff_sd_ratio),
          median_ratio=median(diff_sd_ratio),
          sd_ratio=sd(diff_sd_ratio))
agg_df
```

Overall our model classifies `r agg_df$n[2]` images correctly out of $N=10.000$, giving it an accuracy of `r (agg_df$n[2]/10000)*100`%. In brief, the summary statistics indicate that softmax outputs of the predicted and runner-up classes seem to be closer to each other when the model misclassifies. Furthermore, incorrectly classified images seem to be associated with greater uncertainty on average. This result is of crucial importance and is a necessary condition for exploring how uncertainty approximation correlates with model predictions.

The model is most successful when classifying images of automobiles (902 correct) and frogs (891 correct). The least successful predictions occur in the cat category, with 565 correct classifications. This is summarized in the following confusion matrix:

![](Figures/conf_mat_dat259.jpg)

The confusion matrix above provides details of exactly how the model fails. In images of cats the most frequent misclassification is dog (145). For dogs the most frequent incorrect label is cat (135). In general it seems that the misclassifications, although wrong, belong to the same domain as the correct label (i.e. one vehicle type is mistaken for another and one species of animal is misclassified as another). This suggests that our model is unable to learn distinguishing features which adequately seperate similar classes. However, one should cautious about reading too much into the type of errors being made. The results may very well be an artefact of the model, the data set or both.

## Examples of Uncertain Classifications {.tabset}

In the following we will visualize the relationships between our variables. We start by looking at examples of certain and uncertain images. The we examine the empirical distribution of the uncertainty estimates $\hat{\sigma}_k$.

### Incorrect classifications

The following plots were generated using Python (code available on [GitHub](link)). On the left hand side we see the unnormalized image with the corresponding ground truth label. The plot in the middle shows the softmax output of the predicted class for each of the $T=100$ stochastic forward passes. $\mu_k$ is given by the solid red line, $\mu_j$ is given by the dashed brown line. The plot title shows both the predicted class and the runner-up class. To the right is a kernel density estimate (using a Gaussian kernel) of the $T=100$ softmax outputs for the predicted class. The plots show the top 5 most uncertain classifications in the entire data set.

![](Figures/8601.jpeg)
![](Figures/5302.jpeg)
![](Figures/4340.jpeg)
![](Figures/5572.jpeg)
![](Figures/1248.jpeg)

### Correct classifications

![](Figures/2579.jpeg)
![](Figures/9468.jpeg)
![](Figures/8169.jpeg)
![](Figures/7778.jpeg)
![](Figures/1079.jpeg)

The runner-up prediction for the five most uncertain but correct predictions are all "airplane". One possible explanation is the presence of large areas of blue and/or white, which we might assume would be present as backgrounds in images of planes flying through the sky.

## Distribution of Uncertainty

The distribution of total uncertainty appears to be bimodal, with peaks close to 0 and 0.2:

```{r fig.height=4, fig.width=6}
# Distribution of estimated uncertainty
p1 <- df %>% 
  ggplot(aes(x=uncertainty)) +
  geom_histogram(col="grey", bins = 50, alpha=.5) +
  ggtitle("Distribution of estimated uncertainty") +
  theme_bw()
p1
```

By grouping the uncertainty estimates by `correct` (i.e. if the label was correctly predicted or not), we can find out how the predictions contribute to the uncertainty distribution:

```{r}
#Distribution of estimated uncertainty by prediction
p2 <- df %>% 
  ggplot(aes(x=uncertainty, col=as.factor(correct))) +
  geom_freqpoly(alpha=.7) +
  ggtitle("Distribution of estimated uncertainty by classification") +
  scale_color_discrete(name="Prediction",
                         breaks=c("0", "1"),
                         labels=c("0: Incorrect", "1: Correct")) +
  theme_bw()
p2
```
The blue line corresponds to the correct predictions, the red line corresponds to incorrect predictions. We see that the incorrect predictions are centered around a higher associated uncertainty, whereas far more of the correctly predicted classes are concentrated around a low uncertainty value. The incorrect classifications greatly contribute to the bimodality, but it is also present in the distribution of uncertainty for the correct classifications. 

The following density plots gives us an idea of how the distributions compare to eachother:

```{r}
# KDE by correct prediction
lines_df <- df %>% 
  group_by(correct) %>% 
  summarise(mean_uncertainty=mean(uncertainty))

p3 <- df %>% 
  ggplot(aes(x=uncertainty, group=factor(correct))) +
  geom_histogram(aes(y=..density.., fill=factor(correct)), position="identity", bins = 50, alpha=.5) +
  geom_density(aes(y=..density.., col=factor(correct)), alpha=.5) +
  geom_rug(aes(x=uncertainty, col=factor(correct)), alpha=.5) +
  geom_vline(data = lines_df, aes(xintercept=mean_uncertainty, col=factor(correct)), lty="dashed") +
  ggtitle("Distributions of uncertainty by classification") +
  labs(color="Prediction",
       fill="Prediction") +
  scale_color_discrete(breaks=c("0", "1"),
                         labels=c("0: Incorrect", "1: Correct")) +
  scale_fill_discrete(breaks=c("0", "1"),
                         labels=c("0: Incorrect", "1: Correct")) +
  theme_bw() +
  theme(legend.position = c(.9,.85))
p3
```

The boxplot gives us yet another way to visualize the difference between uncertainty distributions by predictive ability:

```{r}
# Boxplot of uncertainties for correct vs. incorrect
p4 <- df %>% 
  ggplot(aes(x=as.factor(correct), y=uncertainty)) +
  geom_boxplot(aes(fill=as.factor(correct)), alpha=.7) +
  labs(x="correct") +
  ggtitle("Boxplot of uncertainty distribution by correct/incorrect") +
  scale_fill_discrete(name="Prediction",
                         breaks=c("0", "1"),
                         labels=c("0: Incorrect", "1: Correct")) + 
  theme_bw()
p4
```

If $\hat{\sigma}_k$ is a useful approximation of model uncertainty, then we would expect the distributions $\hat{\sigma}_k$ for correctly and incorrectly classified images to reflect this somehow. From the figures above, it seems clear that the distributions of uncertainties are meaningfully different from one another, at least for this particular choice of model and data set. This indicates that the uncertainty approximation itself may contain valueable information about the model's confidence in its own predictions.

## Class-specific uncertainty

If $\hat{\sigma}_k$ captures model uncertainty, then we would expect the class-specific distributions of $\hat{\sigma}_k$ to somehow echo the confusion matrix in section 3. Let $\hat{\sigma}_j$ denote the class-specific uncertainty, where $j \in \{\text{airplane},...,\text{truck}\}$ is the corresponding class label.

```{r}
to_string <- as_labeller(c(
  "0" = "airplane", 
  "1" = "automobile", 
  "2" = "bird",
  "3" = "cat", 
  "4" = "deer", 
  "5" = "dog", 
  "6" = "frog", 
  "7" = "horse", 
  "8" = "ship", 
  "9" = "truck"))

unc_df <- df %>% 
  group_by(truth) %>% 
  summarise(mean_uncertainty = mean(uncertainty))

mean_df <- as.tibble(rep(mean(df$uncertainty), 10))

hist_class <- df %>% 
  ggplot(aes(x=uncertainty)) +
  geom_histogram(aes(y=..density.., fill=factor(truth)), alpha=.7, bins=50) +
  geom_vline(data=unc_df, aes(xintercept = mean_uncertainty, col=factor(truth)), lwd=.5, lty="dashed") +
  geom_vline(data=mean_df, aes(xintercept = value), lwd=.3, lty="dotted") +
  facet_wrap(~truth, nrow = 2, ncol = 5, drop=FALSE, labeller = to_string) +
  theme_bw() +
  theme(legend.position = "none") +
  ggtitle("Distributions of class uncertainty")

hist_class
```

As noted in section 3, the model clearly mistakes some cats for dogs and some dogs for cats. The confusion seems to be reflected in the relative distributions of $\hat{\sigma}_{cat}$ and $\hat{\sigma}_{dog}$. 

The following boxplots give an idea of the spread of $\hat{\sigma}_j$ for each class $j$ after grouping by classification status:

```{r}
to_string <- as_labeller(c(
  "0" = "airplane", 
  "1" = "automobile", 
  "2" = "bird",
  "3" = "cat", 
  "4" = "deer", 
  "5" = "dog", 
  "6" = "frog", 
  "7" = "horse", 
  "8" = "ship", 
  "9" = "truck"))

unc_df <- df %>% 
  group_by(truth) %>% 
  summarise(mean_uncertainty = mean(uncertainty))

box_class <- df %>% 
  ggplot(aes(y=uncertainty)) +
  geom_boxplot(aes(x=factor(correct), fill=factor(correct)), alpha=.7) +
  facet_wrap(~truth, nrow=2, ncol=5, labeller = to_string) +
  theme_bw() +
  theme(legend.position = "none") +
  ggtitle("Boxplots of class uncertainty") +
  labs(x="correct")

box_class
```

In summary, the class-specific approximations of uncertainty is consistent with the confusion matrix in section 3: Higher uncertainty seems to be associated with categories that tend to be mislabeled.

## Relationship to other variables

As $\hat{\mu}_k$ decreases, the softmax probabilities are spread out across the elements of the predictive mean vector $\hat{\mu}$. Another view is that the number of candidate classes increases. Naively, we would therefore expect lower values of $\hat{\mu}_k$ to be associated with higher uncertainty. 

The following figure shows the relationship between $\hat{\sigma}_k$ and $\hat{\mu}_k$ (we have plotted a smoothed estimate of the mean uncertainty as a function of the `prob1` output to make this clearer): 

```{r}
# Plotting softmax output of prediction against estimated uncertainty
p5 <- df %>% 
  ggplot(aes(x=prob1, y=uncertainty)) +
  geom_point(alpha=.2) +
  geom_smooth() +
  ggtitle("Uncertainty vs. softmax prediction") +
  labs(x="softmax output of predicted class") +
  theme_bw()
p5
```

The plot indicates a concave shape, which contradicts our naive assumption. A possible explanation is that the minimum value of $\hat{\mu}_k$ is bounded below by $0.1$. As $\hat{\mu}_k \rightarrow 0.1$, all the other elements of the predictive mean vector $\hat{\mu}_j \rightarrow 0.1$. This follows from the fact that $\hat{\mu}_k = \text{argmax}_k \hat{\mu}$. Pursuing this line of thought, we speculate that the model most likely yields similar classification results for each stochastic forward pass when $\hat{\mu}_k$ is low, which consequently leads to a lower estimate of $\hat{\sigma}_k$ thus explaining why $\hat{\sigma}_k$ decreases. This highlights a potential drawback of the ad hoc approximation given in section 1.1 : A classification that results in maximum uncertainty with respect to a measure such as entropy (i.e. when $\hat{\mu}_j = 0.1$ for all $j$) could potentially be associated with a small empirical standard deviation $\hat{\sigma}_k$.

Furthermore, $\hat{\sigma}_k$ seems to peak around `prob1` $\approx 0.5$. Additional information can be leveraged by colour-coding the data points based on the value of the runner-up prediction `prob2`, as shown in the following figure:

```{r}
# Plotting softmax output of prediction against estimated uncertainty, coloured by softmax output of runner-up
p7 <- df %>% 
  ggplot(aes(x=prob1, y=uncertainty)) +
  geom_point(aes(col=prob2), shape=21, alpha=.7) +
  geom_point(data=agg_df, aes(x=mean_prob1, y=mean_uncertainty, shape=as.logical(correct)), size=2.5) +
  #geom_point(data=agg_df, aes(y=uncertainty, x=mean_prob1, shape=as.logical(correct)), size=2) +
  ggtitle("Uncertainty vs. softmax output of prediction for all observations") +
  labs(x="softmax output of predicted class", shape="correct") +
  scale_color_distiller(name="Runner up",
                        palette = "Spectral") +
  scale_shape_discrete(name=expression(paste("(",bar(mu)["pred"],", ",bar(sigma)["pred"],")")),
                       labels=c("Incorrect", "Correct")) +
  theme_bw()
p7
```

A blue point corresponds to an observation with a low value of `prob2`, whereas a red point corresponds to an observation with a high value of `prob2`. Additionally, we have plotted the average mean and uncertainty for correctly (black dot) and incorrectly (black triangle) classified observations.

Clearly, the largest values of the `prob2` cluster around `prob1`. This is to be expected, simply because when `prob1` $= 0.5$ we have a maximum of "left-over" probability available for distribution among the other classes in $\hat{\mu}$. In other words, the value of `prob2` is constrained by `prob1`. However, the maximum values of $\hat{\sigma}_k$ appear to be associated with relatively large pairs of `prob1` and `prob2`. This could suggest that uncertainty estimation by ad hoc methods such as $\hat{\sigma}_k$ could be most useful in binary classification settings, or to detect situations with a small number of competing classes in a multilabel setting.

The observations can be partitioned by classification status and overlayed with a 2D kernel density estimate (KDE). The KDE gives an idea of the joint distribution of `prob1` and `prob2`:

```{r fig.width=10, fig.height=5}
# Plotting uncertainty vs. softmax output of prediction
p8 <- df %>% 
  ggplot(aes(x=prob1, y=uncertainty)) +
  geom_point(aes(col=prob2),alpha=0.5) +
  geom_density_2d(col="black", alpha=.3) +
  ggtitle("Uncertainty vs. softmax output of prediction by incorrect/correct") +
  labs(x="softmax output of prediction") +
  facet_grid(.~as.factor(correct)) +
  scale_color_distiller(name="Runner up",
                        palette = "Spectral") +
  theme_bw()
p8
```

The black contour lines indicate where most of the points are concentrated. The plot on the left is for incorrect predictions. The right hand plot represents the correct predictions. For the correct predictions, it seems as if far more of the points are concentrated around high predicted output/low runner-up output/low uncertainty. For the incorrect classifications, most of the points are concentrated around the area where $\hat{\sigma}_k$ peaks. This indicates that the approximated uncertainty estimates indeed contain valuable information in the incorrect cases.

### Runner-up predictions

The following plot shows the relationship between uncertainty estimates and the softmax output of the runner-up prediction. Unsurprisingly, model uncertainty increases as the softmax output of the runner-up increases. We have plotted a smoothed estimate of the mean uncertainty as a function of the runner-up output to make this clearer:

```{r}
# Plotting softmax output of runner-up against estimated uncertainty
p10 <- df %>% 
  ggplot(aes(x=prob2, y=uncertainty)) +
  geom_point(alpha=.2) +
  geom_smooth(method="loess") +
  labs(x="softmax output of runner-up") +
  ggtitle("Uncertainty vs. softmax output of runner-up") +
  theme_bw()
p10
```

## Uncertainty-prediction correlation

As mentioned in section 2.2, the connection established between dropout neural networks and GPs are lost when applied to convolutional neural networks. Performing a logistic regression gives us a simple way of testing if the approximated uncertainty is a significant predictor of the model's ability to predict correctly:

```{r}
# Fitting logistic regression model to check significance of uncertainty
model_sd <- glm(as.factor(correct)~uncertainty, data=df, family = binomial(link="logit"))
summary(model_sd)
```

The coefficient associated with `uncertainty` is highly significant, indicating that the estimates leveraged from MC dropout may indeed be a useful quantification of predictive uncertainty.

# Referral Criteria

The question is then: How do we determine a reasonable uncertainty value for referral to a human expert? As we saw in section 3.1.2, the mean uncertainty values of the incorrect predictions higher than for the correct predictions (`r agg_df$mean_uncertainty[1]` vs. `r agg_df$mean_uncertainty[2]`, respectively).

Naively, we could set the threshold for referral to the mean uncertainty of the incorrectly classified images:

```{r}
# Counting number of correct/incorrect by uncertainty >= .18 
mean_unc <- df %>% 
  filter(correct==0) %>% 
  summarise(mean_unc=mean(uncertainty))

referral <- df %>% 
  filter(uncertainty>=mean_unc[1,1]) %>% 
  count(correct)
referral
```

The problem here is apparent: Many of the correctly classified images are associated with a relatively high level of uncertainty (in our case, this is due to the bimodality of the uncertainty distribution in section 3.2.1). We speculate that although uncertainty seems to contain valueable information, the relative uncertainties of the correct/incorrect observations (in this case) may be too small to differentiate which images should be referred to an expert. We may need to amplify the quantification of uncertainty in some way.

## Usefulness of runner-up predictions

For the most uncertain incorrectly classified images the runner-up suggestions are non-sensical. Still, is there any information to be obtained from the runner-up predictions? What would happen to our overall accuracy if we used the runner-up predictions for all incorrect classifications?

```{r}
# Counting classification accuracy if runner-up is equal to ground truth
class2_df <- df %>%
  mutate(correct=replace(correct, correct==0 & class2==truth, 1)) %>% 
  count(correct)
class2_df
```

Accuracy would rise to `r class2_df$n[2]/(sum(class2_df$n))`. This indicates that there may be some valuable information to be gathered from the runner-up predictions. 

```{r}
runnerup_df <- df %>% 
  filter(correct==0) %>% 
  mutate(runnerup = ifelse(class2==truth, 1, 0))

mean_runnerup <- runnerup_df %>%
  group_by(runnerup) %>% 
  summarise(meanuncertainty = mean(uncertainty),
            meanprob1 = mean(prob1),
            meanprob2 = mean(prob2),
            n=n())
```

```{r}
p11 <- runnerup_df %>% 
  ggplot(aes(x=uncertainty, group=factor(runnerup))) +
  geom_histogram(aes(y=..density.., fill=factor(runnerup)), position="identity", bins = 50, alpha=.5) +
  geom_density(aes(y=..density.., col=factor(runnerup)), alpha=.5) +
  geom_rug(aes(x=uncertainty, col=factor(runnerup)), alpha=.5) +
  geom_vline(data = mean_runnerup, aes(xintercept=meanuncertainty, col=factor(runnerup)), lty="dashed") +
  ggtitle("Distributions of uncertainty") +
  labs(color="Prediction",
       fill="Prediction") +
  scale_color_discrete(breaks=c("0", "1"),
                         labels=c("0: Incorrect", "1: Correct")) +
  scale_fill_discrete(breaks=c("0", "1"),
                         labels=c("0: Incorrect", "1: Correct")) +
  theme_bw() +
  theme(legend.position = c(.9,.85))
p11
```

```{r}
p12 <- runnerup_df %>% 
  ggplot(aes(x=prob1, group=factor(runnerup))) +
  geom_density(aes(y=..density.., col=factor(runnerup)), alpha=.5) +
  geom_rug(aes(x=prob1, col=factor(runnerup)), alpha=.5) +
  geom_vline(data = mean_runnerup, aes(xintercept=meanprob1, col=factor(runnerup)), lty="dashed") +
  ggtitle("Distributions of softmax predictions") +
  labs(color="Prediction",
       fill="Prediction") +
  scale_color_discrete(breaks=c("0", "1"),
                         labels=c("0: Incorrect", "1: Correct")) +
  scale_fill_discrete(breaks=c("0", "1"),
                         labels=c("0: Incorrect", "1: Correct")) +
  theme_bw() +
  theme(legend.position = c(.9,.85))
p12
```


```{r}
p13 <- runnerup_df %>% 
  ggplot(aes(x=prob2, group=factor(runnerup))) +
  geom_density(aes(y=..density.., col=factor(runnerup)), alpha=.5) +
  geom_rug(aes(x=prob2, col=factor(runnerup)), alpha=.5) +
  geom_vline(data = mean_runnerup, aes(xintercept=meanprob2, col=factor(runnerup)), lty="dashed") +
  ggtitle("Distributions of runner up softmax probabilities") +
  labs(color="Prediction",
       fill="Prediction") +
  scale_color_discrete(breaks=c("0", "1"),
                         labels=c("0: Incorrect", "1: Correct")) +
  scale_fill_discrete(breaks=c("0", "1"),
                         labels=c("0: Incorrect", "1: Correct")) +
  theme_bw() +
  theme(legend.position = c(.9,.85))
p13
```

So how do we determine which images to take a closer look at?

## Leveraging Runner-up Probabilities

One possible approach is to incorporate information from the value of $\tau_{jk}$, which we introduced in section 2.5. Consider the following cases:

* $\tau_{kj} \approx 1$ means that `diff` and `uncertainty` are relatively similar. This happens if a) the models have failed to agree on a clear favourite (`diff` is small) but model uncertainty is low, or b) the models have a favourite (`diff` is large) but model uncertainty is high. Let's call these **referral predictions**.

* $\tau_{kj} \to 0$ means that `uncertainty` is much larger than `diff`. These should represent **uncertain predictions**.

* $\tau_{kj} \to \infty$ means that `uncertainty` is much smaller than `diff`. These should represent **non-referral predictions**.

As we saw in section 3, the mean $\tau_{jk}$ (given by `mean_ratio`) for the incorrectly classified images is `r agg_df$mean_ratio[1]` and `r agg_df$mean_ratio[2]` for the correctly classified images. Setting the referral threshold to the mean $\tau_{jk}$ for the incorrect classifications (`r agg_df$mean_ratio[1]`), we get:

```{r}
# Counting number of correct/incorrect by tau <= 2.8
mean_tau <- df %>% 
  filter(correct==0) %>% 
  summarise(mean_tau=mean(diff_sd_ratio))
  
referral_mean <- df %>% 
  filter(diff_sd_ratio <= mean_tau[1,1]) %>% 
  count(correct)
referral_mean
```

We run into the same problem as when we used the mean uncertainty: Many of the correctly predicted images would be referred, in fact more than when we used the uncertainty estimates. It is interesting to note, however, that we get far more referrals of incorrectly classified images. Again, this may indicate that $\tau_{jk}$ is a useful measure of uncertainty.

Setting the referral rate of the mean $\tau_{jk} \leq 1$ gives us the following:

```{r}
# Counting number of correct/incorrect by tau <= 1
median_tau <- df %>% 
  filter(correct==0) %>% 
  summarise(median_tau=median(diff_sd_ratio))

referral_1 <- df %>% 
  filter(diff_sd_ratio <= 1) %>% 
  count(correct)
referral_1
```

This is a clear improvement over our first approach, in terms of the amount of referred images. `r referral_mean$n[1] - referral_1$n[1]` fewer incorrect images are referred compared to `r referral_mean$n[2] - referral_1$n[2]` fewer correct images. This could indicate that $\tau_{jk}$ is a useful quantification of uncertainty.

# References

[1] Gal, Y. & Ghahramani, Z. *Bayesian Convolutional Neural Networks with Bernoulli Approximate Variational Inference*. arXiv: 1506.02158 (2015).

[2] Gal, Y. & Ghahramani, Z. *Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning*. arXiv: 1506.02142 (2015).

[3] Gal, Y. & Ghahramani, Z. *Dropout as a Bayesian Approximation: Appendix*. arXiv: 1506.02157 (2015).

[4] Leibig, C. & Allken, V. et. al. *Leveraging uncertainty information from deep neural networks for disease detection*. Scientific Reports volume 7, Article number: 17816 (2017).

[5] Hinton, G. & Srivastava, N. et. al. *Improving neural networks by preventing co-adaptation of feature detectors*. arXiv: 1207.0580 (2012).

[6] Leibig, C. & Allken, V. et. al. *Leveraging uncertainty information from deep neural networks for disease detection: supplementary material*. Scientific Reports volume 7, Article number: 17816 (2017).

[7] Smith, L., & Gal, Y. **Understanding Measures of Uncertainty for Adversarial Example Detection.** arXiv preprint arXiv:1803.08533. (2018)